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ABSTRACT 

As  a  matter  of  fact  Twitter  is  becoming  the  new  big  data 
container,  due  to  the  deep  increase  of  amount  of  users  and 
its  growing  popularity.  Moreover  the  huge  amount  of  user 
profiles  and  rough  text  data,  are  providing  continuosly  new 
research  challenges. 

This  paper  reports  our  contribution  and  results  to  the  Tree 
2012  Microblog  Track.  In  this  particular,  challenge  each  par¬ 
ticipant  is  required  to  conduct  a  ’’real-time”  retrieval  task, 
which  given  a  query  topic  seeks  for  the  most  recent  and 
relevant  tweets.  We  devised  an  effective  real  time  ranking 
algorithm,  avoiding  heavy  computational  requirements.  Our 
contribution  is  multifold:  (1)  adapting  an  existing  ranking 
method  BM25  to  the  microblogging  purpose  (2)  enhancing 
traditional  content-based  features  with  knowledge  extracted 
from  Wikipedia,  (3)  employing  Pseudo  Relevance  Feedback 
techniques  for  query  expansion  (4)  performing  text  analysis 
such  as  ad-hoc  text  normalization  and  POS  Tagging  to  limit 
noise  data  and  better  represent  useful  information. 

1.  INTRODUCTION 

In  the  last  year  the  huge  interest  in  the  microblog  plat¬ 
form  Twitter  has  been  leading  to  large  amounts  of  short  text 
messages  sent  between  users  everyday.  These  short  messages 
called  ’’‘tweets’”  provide  information  ranging  from  job  oppor¬ 
tunities,  personal  opinions,  newspaper  articles  to  simply  au¬ 
thor’s  sentiments.  The  research  community  is  now  focused 
on  different  aspects  such  as  limiting  irrilevant  or  noisy  in¬ 
formation,  employing  ad-hoc  POS  Taggers,  Named  Entity 
Recognition  tools  or  employing  predictive  analysis.  In  June 
2012  the  traffic  data  reached  a  rate  of  400  million  tweet  per 
day.  Consequently  in  order  to  explore  the  behavior  and  rel¬ 
evant  features  in  the  microblog  sphere  is  thus  required  a  lot 
of  work  to  apply  different  linguist  analysis  techniques  and 
data  analysis  methods.  Our  interest  in  microblog  and  so¬ 
cial  networks  allowed  us  to  partecipate  in  Tree  Microblog 
2012  Track  based  on  Twitter  corpus.  Twitter  corpus  cover 
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2  weeks  from  Jan  23rd  to  Feb.  8tli  2011  where  TREC  iden¬ 
tified  60  query  topics. 

2.  MICROBLOG  AD-HOC  TASK 

2.1  Task 

This  year  is  the  second  time  TREC  conference  releases  a 
task  in  microblog  track.  The  first  challenge  consists  in  find¬ 
ing  and  ranking  relevant  tweets  given  a  query  topic.  This 
task  is  aimed  to  simulate  user  behaviours  that  want  to  re¬ 
trieve  some  information  on  Twitter  using  few  keywords.  Ba¬ 
sically,  the  goal  is  to  retrieve  the  best  relevant  tweet  perti¬ 
nent  to  the  given  keyword  and  discard  the  irrilevant  ones. 
In  the  evaluation,  all  retrieved  document  are  ordered  by  re¬ 
cency  and  then  evaluated  by  traditional  metrics,  such  as  Pre¬ 
cision  @30,  R-precision,  MAP  and  ROC  curve.  The  judg¬ 
ment  is  based  on  3-level  scale:  non-relevant,  relevant  and 
high-relevant.  In  Tree  2012,  only  high  relevant  tweets  are 
taken  into  consideration. 

2.2  Dataset 

The  corpus  available  for  testing  correspond  to  the  one  used 
in  2011,  which  consist  approximately  in  16  million  docu¬ 
ments  gathered  in  the  period  24-1-2011  and  8-02-2011.  How¬ 
ever  we  were  able  to  collect  around  15  million  tweets  due  to 
protected  users  or  tweet  deletion.  Every  document  contains 
many  information  for  research  such  as  information  about 
the  tweet  and  the  user  profile.  We  collected  those  tweets  by 
means  of  web  crawler  and  extract  several  features  from  the 
raw  html  source  such  as  the  url,  number  of  retweets,  if  the 
tweet  is  a  reply  part  of  conversation,  the  location  and  so  on. 
At  the  end  of  retrieval  phase,  two  English  language  recog¬ 
nizers  have  been  employed:  TextCat1  and  Apache  TIKA2. 
All  the  retweeted  data  are  also  discarded  according  with  the 
TREC  Microblog  guidelines.  Finally,  the  final  corpus  con¬ 
sists  of  5  million  tweets. 

3.  THE  RETRIEVAL  SYSTEM 

Our  approach  is  based  on  the  idea  that  actual  ranking 
algorithm  based  on  tf-idf  weighting  schema  and  keyword 
matching  can  be  a  baseline  for  a  microblog  ranking  system. 
Nevertheless  this  approach  has  large  chances  to  be  improved 
considering  the  peculiar  characteristics  of  the  domain. 

We  started  building  our  system  entirely  using  the  open  source 

1http:  / /texteat. sourceforge.net  / 

2http:/ /tika. apache.org/ 
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Figure  1:  The  Proposed  Retrieval  and  Ranking  Sys¬ 
tem 


non  relevant 

relevant 

high  relevant 

avg  length  (token) 

8.58 

10.74 

10.21 

avg  ff  of  retweet 

8.91 

0.34 

0.41 

avg  url  presence 

52% 

76% 

94% 

avg  #  of  reply 

0.12 

0.04 

0.02 

avg  ff  of  hashtag 

0.25 

0.19 

0.21 

%  of  query  terms 

15% 

40% 

43% 

avg  noise  terms 

8% 

3% 

2% 

Table  1:  General  statistic  of  Twitter  features 


where  docFreq  is  the  document  frequency  of  the  query 
word.  Although  Lucene  provides  high  perfomance  for  in¬ 
dexing  and  searching,  was  hard  to  adapt  it  to  the  new  al¬ 
gorithm  BM25  because  document  frequencies  and  further 
implied  features  are  deeply  encoded  in  the  Lucene  system. 
Further  algorithms  that  analyze  specific  social  features  ex¬ 
tracted  from  the  tweets  and  combined  with  the  revised- 
BM25  ranking  function  are  to  be  discussed  in  the  next  sec¬ 
tions. 


information  retrieval  system  Apache  Lucene  to  store  and  in¬ 
dex  documents.  As  can  be  seen  in  the  Figure  1  the  first 
processing  step  of  our  approach  is  a  ranking  system  based 
on  BM25  formula  implemented  into  Java  Lucene3  and  a  Mi¬ 
croblog  Scoring  that  will  be  explain  in  the  next  section.  This 
first  step  is  also  used  to  retrieve  the  top-k  relevant  tweets 
and  proceed  with  the  query  expansion  process  in  order  to 
improve  accuracy  of  the  ranking  system. 

3.1  Revised-BM25 

Our  BM25  algorithm  approach  is  a  variant  of  the  stan¬ 
dard  BM25  ranking  function.  The  differences  between  our 
function  and  standard  BM25  are  in  the  computation  of  term 
weights  and  the  normalization  of  length  weight.  In  both  of 
the  calculations  we  use  the  inverse  function  instead  of  the 
normal  weight.  Thus  given  a  query  q  composed  by  query 
terms  {qi..qn}  ,  the  score  assigned  to  the  document  D  is 
obtained: 


BM25(q,D)  =  J2IDFrt(qi 


tf(qitD )  '  +  1) 

lntn)-(l-b+(b-(LENY) 

(1) 

where  k  and  b  are  weight  parameters  set  to  k  =  2  and 
b  =  0.674,  augi(collection)  is  the  average  document  lenght 
in  the  collection.  LEN  is  the  length  weight  normalization 
adjusted  for  short  text  and  consequently  for  Twitter.  We 
are  assigning  an  higher  score  to  longer  tweets: 


LEN  =  LENGTH M  ) 
avgl(collection) 


(2) 


3.2  Micro-Score  system 

As  a  feature  of  microblogs,  the  limitation  of  text  length 
forces  users  to  write  concisely.  Sometimes  this  means  that 
tweets  do  not  directly  include  relevant  content,  but  contain 
references  or  other  microblog  specific  features  to  represent 
it.  Table  1  summaries  statistics  of  some  relevant  features  in 
tweets  judged  non  relevant,  relevant  or  high  relevant  in  2011 
Twitter  Corpus. 

It  is  clear  that  the  average  noise  terms  (of  which  we  will 
discuss  later)  is  inversely  proportional  to  the  relevance  of 
the  tweet  while  the  urls  occur  more  often  in  high  relevant 
tweets.  Based  on  that  statistics  we  propose  the  generalized 
formula: 


M score(T)  =  a ■  url+fi- log(rt)  +  7 ■  reply +  5-  log(noisetxt) 

(4) 

where  the  presence  of  some  features  are  linearly  combined 
altering  the  traditional  ranking  function.  The  first  element 
url  regards  the  presence  of  urls  in  a  tweet.  By  using  urls 
users  can  include  many  extra  information  such  as  an  images, 
videos  or  correlated  newspaper  articles.  Nonetheless  we  ar¬ 
gue  that  analyzing  every  url  and  the  page  included  in  it,  is 
not  suitable  for  a  real  time  ranking.  Thus  we  decided  to 
include  just  the  feature  that  shows  the  url  presence  in  the 
tweet. 

Another  microblog  feature  included  in  the  formula  is  the 
number  of  retweets.  Preliminary  experiments  on  small  cor¬ 
pus  used  to  assign  weights  to  each  feature  show  how  this  is 
not  a  so  reliable  indicator. 

In  the  next  step  we  classify  the  tweets  in  two  class: 


TF  is  the  term-frequency  and  IDF  is  inverted  document 
frequency  calculated  as  follows: 


IDF(qi) 


log 


N  —  docFreq  +  0.5 
docFreq  +  0.5 


(3) 


3The  first  version  of  our  Java  implementation  of  BM25 
ranking  algorithm  for  Lucene,  is  available  for  download  at 
http:/ /code. google. com/p/bm25-lucene-score/  . 

4This  values  are  obtained  by  preliminary  tests 


•  conversational  tweets 

•  direct  tweets 

Tweets  are  into  the  former  category  if  they  are  a  reply 
to  another  user  tweet,  or  they  are  retweets.  In  this  case  we 
assign  zero  weight,  1  otherwise. 

The  last  feature  called  noise  terms  or  noise  text,  indicates 
tweets  that  contain  noisy  information.  Our  hypothesis  is 


that  a  relevant  tweet  contains  well  written  text,  e.g.  syntac¬ 
tically  correct. 

Therefore  we  extracted  a  subset  of  tweets  representing  a 
noise  text  class: 

Repeated  Letter:  Ex.  looovee,  soooo  much 

Alphanumeric  Words:  Ex.  2night,  4ever,  str8 

Strong  presence  of  Smiley:  Ex.  0_0,  :) 

In  order  to  recognize  those  dirty  text,  we  employed  regular 
expression  techniques.  An  example  is  given  below: 

Repeated  Letter:  .  *  ( [a-zA-Z]  )\\1{2} .  * 

Alphanumeric  Words:  [0-9]  {I}  [a-zA-Z]  +  [a-zA-Z]  +  [0- 
9] {1} [a-zA-Z] * 

The  outcome  is  a  value  close  to  1  if  the  tweet  contains  an 
high  level  of  syntactically  incorrect  content.  The  outcome 
is  a  value  close  to  0  if  the  tweet  does  not  contains  (in  our 
hypothesis)  noise  text. 

Afterwards  that  we  calculated  the  Mscore  formula  placing  a 
value  for  a  ,  /3 , 7  and  5  parameter. 

The  linear  combination  between  Mscore  and  BM25-Revised 
is  our  baseline  algorithm  for  ranking  tweet. 

4.  QUERY  EXPANSION 

Subsequently  to  a  deep  analysis  of  the  microblog  dataset, 
we  discovered  that  some  query  terms  are  not  able  to  retrieve 
all  the  relevant  tweets.  Basically,  the  recall  is  negatively  al¬ 
tered  because  of  the  mismatch  between  the  query  keywords 
and  the  terms  in  the  tweets  (e.g.,  see  the  well-known  vocab¬ 
ulary  problem  [6].  In  order  to  improve  the  retrieval  recall  we 
decided  to  set  up  a  full  automatic  query  expansion  module. 
Starting  from  top-15  documents  ranked  by  our  system,  we 
follow  two  query  expansion  steps: 

1.  Wikipedia  Topic- Entity  Expansion 

2.  Noun-Adjective  Pseudo  Relevance  Feedback 
The  following  sections  describe  in  details  the  steps. 

4.1  Wikipedia  Topic-Entity  Expansion 

In  this  first  step  of  query  expansion  we  used  a  service 
called  Wikify  Service5  included  in  the  Wikipedia  Miner  project 
of  the  Waikato  University.  This  service  automatically  de¬ 
tects  the  topics  mentioned  a  the  given  document,  and  also 
provides  the  probability  that  detected  topics  are  right.  Hence, 
we  extracted  the  topics  with  best  likelihood  from  top- 15 
ranked  tweets  and  built  a  vector  representation  for  the  se¬ 
mantic  analysis. 

Afterward  we  analyze  the  semantic  relatedness  between 
the  extracted  topics  and  the  original  query  terms  using 
the  Compare  Service 6  of  Wikipedia  Miner.  Figure  2  shows 
an  example  of  how  semantic  relation  works  in  Wikimedia 
Miner.  In  few  words  the  relatedness  measures  are  calcu¬ 
lated  from  the  links  going  into  and  out  of  each  page.  Links 

5  Live  demo  is  available  here:  http://wikipedia- 

miner.cms.  waikato .  ac .  nz /services  /?  wikify 
GLive  demo  is  available  here:  http://wikipedia- 

miner.cms.  waikato. ac.nz/services/?compare 


that  are  common  to  both  pages  are  used  as  evidence  that 
they  are  related,  while  links  that  are  unique  to  one  or  the 
other  indicate  the  opposite.  The  page  is  the  Wikipedia  Page 
representing  the  term  (if  available).  For  the  query  expan¬ 
sion  we  suppose  that  if  the  extracted  Wikipedia  topic  and 
the  query  term  have  an  high  semantic  similarity,  the  topic 
can  be  quite  good  for  the  query  expansion.  The  principal 
benefit  of  this  approach  is  that,  given  ambiguous  terms,  such 
as  NSA,  we  are  able  to  obtain  the  potential  meanings  (e.g, 
National  Security  Agency,  Information  Security). 


Figure  2:  Wikipedia  relation  between  cat  and  dog 

4.2  Noun- Adjective  Pseudo  Relevance  Feedback 

The  second  step  of  query  expansion  analyzes  the  lexicon 
contained  in  top-15  ranked  tweet.  In  order  to  automatically 
extract  the  terms  for  expansion  we  used  the  Pseudo  Rele¬ 
vance  Feedback  approach.  Basically  we  selected  the  top-15 
ranked  documents  and,  using  a  Part  of  Speech  Tagger' , 
we  extracted  the  Proper  Nouns  (NNP)  and  Adjectives  found 
in  the  documents.  Finally,  we  ranked  those  terms  with  tf- 
idf  weight  schema,  to  better  represent  relevant  terms.  The 
entire  process  can  be  summarized  as  follows: 

1.  Select  top-15  documents  from  First  Retrieval 

2.  Process  all  documents  with  POS  Tagger 

3.  Extract  Noun  and  Adjectives  words 

4.  Weight  extracted  words  with  ft-idf  formula 

5.  Select  top-3  extracted  words  for  query  expansion 

At  the  end  of  Wikipedia  and  Pseudo  Relevance  Feedback 
processes,  we  combined  both  modules  inserting  the  terms 
into  the  original  query.  If  the  terms  -  for  some  reasons  - 
are  not  available,  the  following  step  is  skipped.  In  order  to 
assign  a  weigth  for  the  expanded  terms  we  used  the  Lucene 
boosting  feature.  Preliminary  tests  have  led  us  to  set  to  1 
the  weight  of  the  original  query  terms  and  0.3  for  the  ex¬ 
panded  terms. 

An  example  of  expansion  is: 

•  <QUERY  TOPIO:  108 

•  <QUERY>:  <identity  theft  protection> 

'We  used  the  Stanford  University  Pos  Tagger 


•  <  EXPANDED  > :  <  identity  theft  protection  Crime  “0.3 
Fraud~0.3  Service~0.3  > 

where  all  the  query  terms  are  combined  by  the  OR  op¬ 
erator.  Thus,  a  possible  query  term  vector  is  identity  theft 
fraud  or  identity  protection  service. 

Finally,  the  final  query  expanded  is  given  in  input  to  the 
Revised-BM25  and  Mscore  retrieval  system,  in  order  to  re¬ 
rank  the  documents. 


Summary  Statistics 

Run  ID: 

Alrunl 

Task  : 

adhoc 

Run  type: 

automatic 

Collection  crawl  dates: 

december  2011 

Number  of  200/301  tweets  in  crawl: 

15000000 

Collection  crawl  indexed: 

HTML 

Follows  realtime  constraints? 

yes 

Uses  documents  linked  from  tweets? 

no 

Uses  other  external  resources? 

yes 

Number  of  Topics: 

59 

Total  number  of  documents  over 

all  topics 

Retrieved: 

58982 

Relevant: 

2572 

Relevant  retrieved: 

1445 

Mean  average  precision: 

0.1522 

R-precision: 

0.1930 

Figure  3:  Summary  of  AIRUN1 


5.  EVALUATION 

In  this  section  we  present  the  offical  2012  Tree  evaluation 
for  ad-hoc  task. 

We  submitted  only  one  run  to  microblog  challenge.  Our  run 
AIRUN1 uses  no  future  evidence  but  employs  external 
evidence  such  as  Wikipedia  Miner.  Furthermore,  we  sub¬ 
mitted  a  full  automatic  run  without  manual  interation. 
Figure  3  shows  some  statistics  about  our  run  such  as  MAP 
and  R-Precision.  Table  2  shows  high  relevant  official  results 
detected  by  TREC,  and  our  self-made  evaluation  for  all  rel¬ 
evant  results8. 

Figure  4  shows  the  difference  between  our  median  P@30 
and  median  P@30  of  all  other  runs,  for  each  query  topic. 
AIRUN1  outperforms  the  median  in  most  of  query-topics. 
However  in  few  cases  our  system  is  quite  under  the  median 
and  our  expectations.  In  detail  we  recognize  the  topics  81, 
85,  89  in  which  our  system  has  scored  almost  zero  for  P@30 
and  MAP.  The  query  expansion  activity  process  was  not 
able  to  extract  relevant  adjectives  and  noun  in  the  lexicon. 
Nevertheless  our  system  is  able  to  improve  Lucene  baseline 
score,  in  particular  with  query  expansion  and  the  MScore 
modules.  In  Table  3  we  summarized  the  score  of  each  mod¬ 
ule,  from  Lucene  baseline  to  the  full  working  system  with 
BM25+Mscore+QueryExpansion  modules.  We  achieved  an 
improvement  from  0.19  to  0.29  in  Precision@5  using  the 
combination  of  all  the  developed  modules. 


8  The  evaluation  was  performed  using  the  same  metrics  re¬ 
leased  by  the  TREC 


All  Relevant 

High  Relevant 

P@5  |  P@30  |  MAP  | 

R-prec  |  P@30  |  MAP 

2012  AIRun 

0.501  0.40  0.331 

0.193  |  0.2  |  0.152 

Table  2:  2012  High  Relevant  Official  Results  and  All 
Relevant  Unofficial  Results 


AIRUN  1 

P@5 

P@30 

MAP 

Lucene  baseline 

0.19 

0.12 

0.08 

+  Bm25(Rev.) 

0.23 

0.14 

0.10 

+  Mscore 

0.27 

0.17 

0.13 

+  Q.E. 

0.29 

0.208 

0.15 

Table  3:  High  relevant  score  evaluated  for  every 
module  of  our  approach 


Topic 

Difference  from  median  P@30  per  topic 

Figure  4:  Difference  -for  each  query  topic-  between 
median  P@30  and  AIRUN  P@30 


Figure  5:  Difference  from  median  average  precision 
per  topic 


GROUP  ID 

P@30 

GROUP  ID 

P@30 

HIT  MTLAB 

0.2701 

udel 

0.196 

ICTNET 

0.2384 

Waterloo 

0.1955 

KobeU 

0.2384 

NC  SILS 

0.1938 

PKUICST 

0.2333 

FUB 

0.1932 

CMU  Call  an 

0.2333 

QCRI 

0.1921 

ot 

0.2328 

UGENT  IBCN  SIS 

0.1904 

FASILKOMUI 

0.2294 

qcri  twitsear 

0.1898 

IBM 

0.2254 

GUCAS 

0.1876 

uogTr 

0.2232 

SCIAITeam 

0.1808 

uiucGSLIS 

0.2186 

UvA 

0.1774 

PKUICST 

0.2164 

BAU 

0.174 

UWaterlooMDS 

0.2107 

udel  fang 

0.1616 

york 

0.2102 

csiro 

0.1616 

XMU  PANCHAO 

0.2023 

uog  tw 

0.1582 

BUPT  WILDCAT 

0.2028 

HEIR 

0.1508 

Al  ROMA3 

IRIT 

0.1994 

0.1983 

UEdinburgh 

0.1226 

[10]  J.  Prez-Iglesias.  Integrating  the  probabilistic  models 
bm25/bm25f  into  lucene.  Proceedings  of  the  fourth 
ACM  international  conference  on  Web  search,  2009. 

[11]  S.  Ravikumar  and  et  al.  Ranking  tweets  considering 
trust  and  relevance.  ACM  2012,  2012. 

[12]  A.  T.  Rinkesh  Nagmoti  and  M.  D.  Cock.  Ranking 
approaches  for  microblog  search.  IEEE/ WIC/ ACM 
International  Conference  on  Web  Intelligence  and 
Intelligent  Agent  Technology,  2010. 

[13]  I.  Soboroff  and  E.  M.  Voorhees.  Overview  of  the  tree 
2011  web  track.  Proceedings  of  Tree  2011,  2011. 


Table  4:  List  of  groupID  in  the  Ad-hoc  task  sorted 
by  best  P@30 

6.  CONCLUSIONS 

In  this  paper  we  described  the  approach  for  the  TREC 
2012  Microblog  Track.  We  propose  an  effective  and  straigh- 
forward  real-time  algorithm  for  ranking  tweets  that  is  able  to 
exploit  traditional  retrieval  and  semantic  annotation  tools. 
The  evaluation  confirms  that  the  approach  is  able  to  rank 
high  relevant  tweets  with  0.2  Precision@30  higher  than  the 
median  of  all  approaches. 

It  is  interesting  to  take  in  consideration  users  interests,  their 
older  tweets  and  friends  in  order  to  create  a  basic  user  pro¬ 
filing.  We  are  planning  also  to  try  some  effective  improve¬ 
ment  in  our  query  expansion  system  and  in  Microblog-Score. 
In  particular  we  are  studying  techniques  for  identifying  au¬ 
thoritative  and  influential  users  given  a  specific  topic.  In 
our  opinion  this  combined  approach  can  improve  the  rank¬ 
ing  system,  limiting  issues  related  to  spam  documents.  Fi¬ 
nally  we  will  investigate  the  improvement  of  discovering  the 
dirty  text  class,  and  propose  a  Machine  Learning  approach  to 
identify  and  automatically  normalize  this  class  of  problems. 
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